We ended the last class starting to think about models with categorical predictors
We’ll recap those last few slides on ANOVA
Then look at categorical predictors more generally, including the concepts of interactions
Analysis of Variance
Analysis of Variance
Analysis of Variance (ANOVA) is the classic name for a Gaussian linear model where the predictor (explanatory) variables are categorical
Earlier ANOVA table used to partition variance in \(y\) into components explained by \(x_j\) & a residual component not explained by the regression model
A slightly more restricted view of ANOVA is that it is a technique for partitioning the variation in \(y\) into that explained by one or more categorical predictor variables
The categories of each factor are the groups or experimental treatments
Analysis of Variance
ANOVA considers the different sources of variation that might arise on a data set
Of particular interest is on the differences in the mean value of \(y\) between groups
We can think of within-group and between-group variances
Between-group variance is that due to the treatment (group) effects
Within-group variance is that due to the variability of individuals & measurement error
There Will always be variation between individuals but is this within-group variance large or small, relative to the variance between groups?
ANOVA how many ways?
One of the complications surrounding ANOVA is the convoluted nomenclature used describe variants of the method
Variants commonly distinguished by the number of categorical variables in the model
contains a single categorical variable
contains two categorical variables
contains three categorical variables
…
Two-way and higher ANOVA potentially involve the consideration of factor—factor interactions
One-way ANOVA
In a we have a single categorical variable \(x\) with two or more levels With two levels we have the same analysis as the t test
If we consider differences between animals of different breed, we might use an breed factor whose levels might be
Danish Holstein,
Red Danish
Jersey
If we’re testing the effect of parity, the factor might be parity with levels: 1, 2, 3, & 4+
One-way ANOVA
Assume we have a single categorical variable \(x\) with three levels. The One-way ANOVA model using dummy coding or treatment contrasts is
An overall effect of bill_length_mm averaged over the three species plus species-specific differences in effect of bill_length_mm
bill_depth_mm ~ species / bill_length_mm
Directly estimates the effect of bill_length_mm within each species
bill_m3 <-lm(bill_depth_mm ~ species / bill_length_mm, data = penguins)bill_m4 <-lm(bill_depth_mm ~ species / bill_length_mm -1, data = penguins)
Interaction?
summary(bill_m4)
Call:
lm(formula = bill_depth_mm ~ species/bill_length_mm - 1, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.6574 -0.6675 -0.0524 0.5383 3.5032
Coefficients:
Estimate Std. Error t value Pr(>|t|)
speciesAdelie 11.40912 1.13812 10.025 < 2e-16 ***
speciesChinstrap 7.56914 1.70983 4.427 1.29e-05 ***
speciesGentoo 5.25101 1.33528 3.933 0.000102 ***
speciesAdelie:bill_length_mm 0.17883 0.02927 6.110 2.76e-09 ***
speciesChinstrap:bill_length_mm 0.22221 0.03493 6.361 6.55e-10 ***
speciesGentoo:bill_length_mm 0.20484 0.02805 7.303 2.06e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9548 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.997, Adjusted R-squared: 0.9969
F-statistic: 1.858e+04 on 6 and 336 DF, p-value: < 2.2e-16
anova(bill_m1, bill_m2)
Analysis of Variance Table
Model 1: bill_depth_mm ~ bill_length_mm + species
Model 2: bill_depth_mm ~ bill_length_mm * species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 338 307.20
2 336 306.32 2 0.87243 0.4785 0.6202
Models with main effects only and with main effects plus interaction
gain_m1 <-lm(gain ~ vitamin_1 + vitamin_2, data = pig_gain)gain_m2 <-lm(gain ~ vitamin_1 * vitamin_2, data = pig_gain)
Pig weight gain?
anova(gain_m1, gain_m2)
Analysis of Variance Table
Model 1: gain ~ vitamin_1 + vitamin_2
Model 2: gain ~ vitamin_1 * vitamin_2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17 0.20559
2 16 0.17648 1 0.029108 2.639 0.1238